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Description 

FIELD OF THE INVENTION 

5 [0001] The present invention relates to digital microprocessors, and in particular but not exclusively, to microproc- 
essors configurable to repeat program flow. 

BACKGROUND TO THE INVENTION 

10 [0002] Many different types of processors are known, of which microprocessors are but one example. For example. 
Digital Signal Processors (DSPs) are widely used, in particular for specific applications, such as mobile processing 
applications. DSPs are typically configured to optimize the performance of the applications concerned and to achieve 
this they employ more specialized execution units and instruction sets. Particularly in. but not exclusively, applications 
such as mobile telecommunications applications, it is desirable to provide ever increasing DSP performance while 
'5 keeping power consumption as low as possible. 

[0003] In a DSP or microprocessor, machine-readable instructions stored in a program memory are sequentially 
executed by the processor in order for the processor to perfonn operations or functions. The sequence of machine- 
readable instructions is termed a "program". Although the program instructions are typically performed sequentially, 
certain inslruclions permit the program sequence to be broken, and for the program flow to repeat a block of instructions. 
Such lepetition of a block of instructions is known as "looping/* and the block of instructions are known as a "loop" or 
a "block." 

[0004] In order to reduce power consumption, many microprocessors provide a low power mode in which the clock 
is slowed during times of inactivity, or certain peripheral devices are turned off when not needed. The processor may 
enter an "idle" mode or a "sleep" mode until an interrupt occurs to restart full operation. 
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SUMMARY OF THE INVENTION 



[0005] The teachings of the present application disclose a digital system, and a method.of operating a digital system, 
to further reduce power consumption by microprocessors such as. for example but not exclusively, digital signal proc- 
30 essors. 

[0006] In particular a method for operating a digital system that includes a microprocessor is disclosed. A portion of 
the microprocessor is partitioned into a plurality of partitions. The microprocessor executes a sequence of instructions 
within an instruction pipeline of the microprocessor, and repetitively executes a block of instructions within the sequence 
of instructions. It is determined that at least one of the plurality of partitions is not needed to execute the block of 
instructions. In order to reduce power dissipation, operation of the unneeded partition(s) is inhibited while the block of 
instructions is repetitively executed. 

[0007] A repeat profile parameter may be provided that is indicative of the partition(s) not needed to execute of the 
block of instructions. The repeat profile parameter is preferably provided by an instruction executed prior to the block 
of instructions. The repeat profile parameter may be determined by monitoring execution of a first iteration of the block 
of instructions and thereby deriving the repeat profile parameter. Separate repeat profile parameters may be provided 
for an inner loop and an outer loop. 
[0008] Preferably an interrupt is provided during execution of the block of instmctions that causes masking of the 
partition inhibition so that all partitions of the microprocessor are enabled during execution of the ISR and unmasking 
of partition inhibition when returning to repetitive execution of the block of instructions after execution of the ISR is 
•^5 completed. 

[0009] Various portions of the microprocessor can be partitioned and partially inhibited during execution of a block 
of inslruclions. For example. Iheiinslruclion decoder is partitioned according to groups of inslruclions. The instruction 
register is partitioned according to various instmction lengths. The instruction pipeline is partitioned according to parallel 
instruction execution. A portion of the microprocessor is partitioned according to data types. Address generation cir- 
cuitry is partitioned according to address modes. Status circuitry is inhibited if not required during execution of the 
block of instructions. 

[0010] A method for assembling a source code program to create a sequence of instructions is also disclosed. The 
sequence of instructions has a repeatable block of instructions including an initial instmction and a final instruction. An 
instruction table is created with an entry for each instruction executable by a selected microprocessor, such that the 
entry for each instruction includes a group pattern defining a group of instmctions that includes that instmction. The 
source code is transformed into a sequence of instructions, and the initial instruction and the final instruction is deter- 
mined for a repeatable block of instmctions associated with a prologue instruction. A plurality of group patterns selected 
from the instruction table representative of each instruction in the block of instmctions is combined to form a repeat 
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profile pRrameter, and the repeat profile parameter is associated with the prologue instruction in the sequence of in- 
structions. 

[001 1] Partitioning of the instruction decoder for several instojclion groups allows one or more of the decoder parti- 
tions to remain idle during execution of an instruction loop. Consequently, there is a corresponding reduction in power 
consumption by the microprocessor. a cvjuuuum m power 

[001 2] Therefore, the teachings of the present application are particularly suitable for use in portable apparatus such 
as wireless communication devices. Typically such a wireless communication device comprise a user interface includ- 

^VT'^ '^'^P'^y °' ^ '''^P'^y- ^ "^^yP^d °' k^y^oa^d inputting data to the com- 

Tat on S f f "r"'^' ^ ""'f communication device will also comprise an antenna for wireless commu- 

nication with a radio telephone network or the like. 

BRIEF DESCRIPTION OF THE DRAWINGS 

[0013] The present invention will now be described in detail, by way of example, with reference to certain exemplary 
embodiments illustrated in the accompanying drawings, in which: 

Figure 1 is a schematic block diagram of a processor; 
Figure 2 is a schematic illustration of a wireless communication device; 
Figure 3 is a schematic diagram of a core of the processor: 

Figure 4 is a more detailed schematic block diagram of various execution units of the core of the processor- 
Figure 5 IS schematic diagram of an instruction buffer queue and an instruction decoder controller of the processor- 
Figure 6 IS a representation of pipeline stages of the processor; 

Figure 7 is a diagrammatic illustration of an example of operation of a pipeline in the processor- 
Figure 8 IS a schematic representation of the core of the processor for explaining the operation of the pipeline- 
Figure 9 IS an Illustration of grouping within an instruction set of the processor; 

Figure 1 0 is a block diagram illustrating the instruction execution pipeline of the processor in more detail includinq 
partitions of the instruction decoder; ' 

Figure 11 is a block diagram illustrating the block repeat control circuitry of the processor in more detail includinq 
a repeat profile register and mask; ' ^ 

Figure 12 is a block diagram of an alternative embodiment illustrating the block repeat control circuitry of the 
processor m more detail, including an instruction register for variable size instruction words: 
Figure 13 is a timing diagram illustrating operation of repeat profiles during execution of a nested loop by the 
processor; 

Figure 14 is a flow chart illustrating various steps involved in repetitively executing a block of instruction in the 
35 processor using a repeat profile parameter; 

Figure 15 is a block diagram illustrating monitoring circuitry for determining a profile during execution of a block 
of instructions by the processor; 

Figure 16 is a timing diagram illustrating operation of the monitoring circuitry of Figure 15 during execution of a 
block of instructions by the processor; 
-^0 • Figure 1 7 is a flow chart illustrating various steps involved for forming a repeat profile parameter by an assembler 
by determining what partitions will be needed during execution of a block of instructions; and 
Figure 18 is a timing diagram illustrating execution of a local loop instruction in the instruction execution of the 
pipeline of the processor. 
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•*5 DETAILED DESCRIPTION OF THE INVENTION 



[0014] Although the teachings of the present application find particular applicatiorT IN Digital Signal Processors 
(DSPs), implemented for example in an Application Specific Integrated Circuit (ASIC), these teachings may also find 
application IN other forms of microprocessors. 

[0015] Figure 1 is a block diagram of a microprocessor 10 that is a digital signal processor ("DSP"). In the interest 
of clarity Figure 1 only shows those portions of microprocessor 10 that are relevant to the teachings of the present 
application. Details of general construction for DSPs arc well known, and may be found readily elsewhere. For example, 
U.S. Patent 5.072,41 8 issued to Frederick Boutaud, et aL describes a DSP in detail. U.S. Patent 5,329.471 issued to 
Gar/ Swoboda, et al, describes in detail how to test and emulate a DSP. Details of portions of microprocessor 10 
relevant to an embodiment disclosed by the present application are explained in sufficient detail hereinbelow, so as to 
enable one of ordinary skill in the microprocessor art to make and use thereof. 

[0016] Partitioning of a portion of the processor allows one or more of the partitions to remain idle during execution 
of an instruction loop. Consequently, there is a corresponding reduction in power consumption by the microprocessor. 
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Therefore, the disclosed embodiments are particularly suitable for use in portable apparatus, such as wireless com- 
munication devices. Several example systems that can benefit from aspects of the present teachings are described in 
U.S. Patent 5,072,418, particularly with reference to Figures 2-18 of U.S. Patent 5,072,418. A microprocessor incor- 
porating an aspect of the present teachings for reducing power consumption can be used to further improve the systems 
5 described in U.S. Patent 5,072,418. Such systems include, but are not limited to, industrial process controls, automotive 
vehicle systems, motor controls, robotic control systems, satellite telecommunication systems, echo canceling systems, 
modems, video imaging systems, speech recognition systems, vocoder-modem systems with encryption, and such. 
[001 7] Figure 2 illustrates an exemplary implementation of a digital system embodying aspects of the present teach- 
ings in a mobile telecommunications device, such as a mobile telephone with integrated keyboard 12 and display 14. 
10 Digital signal processor 10 embodying aspects of the present teachings packaged in an integrated circuit 40 that is 
connected to keyboard 1 2, where appropriate via a keyboard adapter (not shown), to the display 14, where appropriate 
via a display adapter (not shown) and to radio frequency (RF) circuitry 1 6. The RF circuitry 1 6 is connected to an aerial 
18. Integrated circuit 40 includes a plurality of contacts for surface mounting. However the integrated circuit could 
include other configurations, for example a plurality of pins on a lower surface of the circuit for mounting in a zero 
'5 insertion force socket, or indeed any other suitable configuration. 

[0018] The basic architecture of an example of a processor according to the present teachings will now be described. 
[0019] Referring again to Figure 1 , microprocessor 10 includes a central processing unit (CPU) 100 and a processor 
backplane 20. In the present embodiment, the processor is a Digital Signal Processor (DSP) implemented in an Ap- 
plication Specific Integrated Circuit (ASIC). 
20 [0020] As shown in Figure 1 , central processing unit 1 00 includes a processor core 1 02 and a memory interface, or 
management unit 104 for interfacing the processor core 102 with memory units external to the processor core. 
[0021] Processor backplane 20 comprises a backplane bus 22, to which the memory management unit 104 of the 
microprocessor is connected. Also connected to the backplane bus 22 is an instruction cache memory 24, peripheral 
devices 26 and an external interface 28. 
25 [0022] It will be appreciated that in other embodiments, alternative impiementions may use different configurations 
and/or different technologies. For example, CPU 100 alone could form processor 10, with processor backplane 20 
being separate therefrom. CPU 100 could, for example be a DSP separate from and mounted on a backplane 20 
supporting a backplane bus 22. peripheral and external interfaces. Microprocessor 100 could, for example, be a mi- 
croprocessor other than a DSP and could be implemented in technologies other than ASIC technology. The micro- 
be processor, or a processor including the processing engine, could be implemented in one or more integrated circuits. 
[0023] Figure 3 illustrates the basic structure of an embodiment of the processing core 102. As illustrated, the 
processing core 102 includes four elements, namely an Instruction Buffer Unit {I Unit) 106 and three execution units. 
The execution units are a Program Flow Unit (P Unit) 1 08, Address Data Flow Unit (A Unit) 1 1 0 and a Data Computation 
Unit (D Unit) 112 for executing instructions decoded from the Instruction Buffer Unit (I Unit) 1 06 and for controlling and 
35 monitoring program flow. 

[0024] Figure 4 illustrates the P Unit 108, A Unit 110 and D Unit 112 of the processing core 102 in more detail and 
shows the bus structure connecting the various elements of the processing core 102. The P Unit 108 includes, for 
example, loop control circuitry, GoTo/Branch control circuitry and various registers for controlling and monitoring pro- 
gram flow such as repeat counter registers and interrupt mask, flag or vector registers. The P Unit 108 is coupled to 
40 general purpose Data Write busses (EB, FB) 130, 132, Data Read busses (CB, DB) 134, 136 and an address constant 
bus (KAB) 142. Additionally, the P Unit 108 is coupled to sub-units within the A Unit 110 and D Unit 112 via various 
busses labeled CSR, ACB and RGD. 

[0025] As illustrated in Figure 4, in the present embodiment, the A Unit 1 1 0 includes a register file 30, a data address 
generation sub-unit (DAGEN) 32 and an Arithmetic and Logic Unit (ALU) 34. The A Unit registerfile 30 includes various 
registers, among which are 1 6-bit pointer registers {AR0-AR7) and data registers (DR0-DR3) which may also be used 
for data flow as well as address generation. Additionally, the register file includes 16-bit circular buffer registers and 
7-bit data page registers. As well as the general purpose busses (EB, FB, CB, DB) 130, 132, 134, 136, a data constant 
bus 1 40 and address constant bus 1 42 are coupled to the A Unit register file 30. The A Unit register file 30 is coupled 
to the A Unit DAGEN unit 32 by unidirectional busses 144 and 146 respectively operating in opposite directions. The 
DAGEN unit 32 includes 1 6-bit XA' registers and coefficient and stack pointer registers, for example for controlling and 
monitoring address generation within microprocessor 1 00. 

[0026] The A Unit 1 1 0 also comprises the ALU 34 which includes a shifter function as well as the functions typically 
associated with an ALU such as addition, subtraction, and AND, OR and XOR logical operators. The ALU 34 is also 
coupled to the general-purpose busses (EB, DB) 130, 136 and an instruction constant data bus (KDBi) 140. The A Unit 
ALU is coupled to the P Unit 108 by a PDA bus for receiving register content from the P Unit 108 registerfile. The ALU 
34 is also coupled to the A Unit registerfile 30 by busses RGA and RGB for receiving address and data register contents 
and by a bus RGD for forwarding address and data registers in the registerfile 30. 

[0027] As illustrated, the D Unit 112 includes a D Unit register file 36, a D Unit ALU 38, a D Unit shifter 40 and two 
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multiply and accumulate units (MAC1 . MAC2) 42 and 44. The D Unit register file 36, D Unit ALU 38 and D Unit shifter 
40 are coupled to busses (EB, FB, CB. DB and KDB) 130, 132. 134. 136 and 140. and the MAC units 42 and 44 are 
coupled to the busses (CB. DB, KDB) 134. 136, 140 and data read bus (BB) 144. The D Unit register file 36 includes 
40-bit accumulators (AC0-AC3) and a 16-bit transition register. The D Unit 112 can also utilize the 16-bit pointer and 
5 data registers in the A Unit 110 as source or destination registers in addition to the 40-bit accumulators. The D Unit 
register file 36 receives data from the D Unit ALU 38 and MACs 1&2 42, 44 over accumulator write busses (ACWO, 
ACW1 ) 1 46, 1 48, and from the D Unit shifter 40 over accumulator write bus (AC W1 ) 1 48. Data is read from the D Unit 
register file accumulators to the D Unit ALU 38, D Unit shifter 40 and MACs 1 &2 42, 44 over accumulator read busses 
(ACRO, ACR1 ) 1 50, 1 52. The D Unit ALU 38 and D Unit shifter 40 are also coupled to sub-units of the A Unit 1 08 via 
10 various busses labeled EFC. DRB, DR2 and ACB. 

[0028] Referring now to Figure 5, there is illustrated an instruction buffer unit 1 06 comprising a 32 word Instruction 
buffer queue (IBQ) 502. The IBQ 502 comprises 32x 1 6-bit registers 504, logically divided into 8-bit bytes 506. Instruc- 
tions arrive at the IBQ 502 via the 32-bit program bus (PB) 122. The instructions are fetched in a 32-bit cycle into the 
location pointed to by the Local Write Program Counter (LWPC) 532. The LWPC 532 is contained in a register located 
in the P Unit 108. The P Unit 108 also includes the Local Read Program Counter (LRPC) 536 register, and the Write 
Program Counter (WPC) 530 and Read Program Counter (RPC) 534 registers. LRPC 536 points to the location in the 
IBQ 502 of the next instruction or instructions to be loaded into the instruction decoder(s) 512 and 514. That is to say, 
the LRPC 534 points to the location in the IBQ 502 of the instmction currently being dispatched to the decoders 51 2^ 
514. The WPC points to the address in program memory of the start of the next four bytes of Instruction code for the 
pipeline. For each fetch into the IBQ, the next four bytes from the program memory are fetched regardless of instruction 
boundaries. The RPC 534 points to the address in program memory of the instruction currently being dispatched to 
the decoder(s) 512 and 514. 

[0029] The instructions are formed into a 48-bit word and are loaded into instruction register 522 and thence to 
instruction decoders 512, 514 over a 48-bit bus 516 via multiplexors 620 and 521, It will be apparent to a person of 
ordinary skill in the art that the instructions may be formed into words comprising other than 48-bits, and that the present 
teachings are not limited to the specific embodiment described above. 

{0030] The bus 51 6 can load a maximum of two instructions, one per decoder, during any one instruction cycle. The 
combination of instructions may be in any combination of formats, 8, 16. 24, 32. 40 and 48 bits, which will fit across 
the 48-bit bus. Decoder 1 , 512, is loaded in preference to decoder2, 514, if only one instruction can be loaded during 
a cycle. The respective instructions are then forwarded on to the respective function units in order to execute them 
and to access the data for which the instruction or operation is to be perfonned. Prior to being passed to the instruction 
decoders, the instructions are aligned on byte boundaries. The alignment is done based on the format derived for the 
previous instruction during decoding thereof. The multiplexing associated with the alignment of instructions with byte 
boundaries is performed in multiplexors 520 and 521 . 
35 [0031] The processor core 102 executes instructions through a seven stage instruction execution pipeline, the re- 
spective stages of which will now be described with reference to Figure 6. 

[0032] The first stage of the pipeline is a PRE-FETCH (PO) stage 202, during which stage a next program memory 
location is addressed by asserting an address on the address bus (PAB) 11 8 of a memory interface, or memory man- 
agement unit 104. 

•^0 [0033] In the next stage, FETCH (P1) stage 204, the program memory is read and the I Unit 106 is filled via the PB 
bus 122 from the memory management unit 104. 

[0034] The PRE-FETCH and FETCH stages are separate from the rest of the pipeline stages in that the pipeline can 
be interrupted during the PRE-FETCH and FETCH stages to break the sequential program flow and point to other 
instructions in the program memory, for example for a Branch instruction. 

^5 [0035] The next instruction in the instruction buffer is then dispatched to the decoder/s 512/514 in the third stage, 
DECODE (P2) 206, where the instruction is decoded and dispatched to the execution unit for executing that instmction, 
for example to the P Unit 108, the A Unit 110 or the D Unit 112. The decode stage 206 includes decoding at least part 
of an instruction including a first part indicating the class of the instruction, a second part indicating the fonnat of the 
instruction and a third part indicating an addressing mode for the instruction. 

50 [0036] The next stage is an ADDRESS (P3) stage 208, in which the address of the data to be used in the instruction 
is computed, or a new program address is computed should the instruction require a program branch or jump. Respec- 
tive computations take place in the A Unit 110 or the P Unit 108 respectively. 

[0037] In an ACCESS (P4) stage 21 0 the address of a read operand is output and the memory operand, the address 
of which has been generated in a DAGEN X operator with an Xmem indirect addressing mode, is then READ from - 
55 indirectly addressed X memory (Xmem). 

[0038] The next stage of the pipeline is the READ (P5) stage 212 in which a memory operand, the address of which 
has been generated in a DAGEN Y operator with an Ymem indirect addressing mode or in a DAGEN C operator with 
coefficient address mode, is READ. The address of the memory location to which the result of the instruction is to be 
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written is output. 

[0039] In the case of dual access, read operands can also be generated in the Y path, and write operands in the X 
path. 

[0040] Finally, there is an execution EXEC (P6) stage 214 in which the instruction is executed in either the A Unit 
11 0 or the D Unit 1 1 2. The result is then stored in a data register or accumulator or written to memory for Read/Modify/ 
Write or store instructions. Additionally shift operations are performed on data in accumulators during the EXEC stage. 
[0041] The basic principle of operation for a pipeline processor will now be described with reference to Figure 7. As 
can be seen from Figure 7, for a first instruction 302, the successive pipeline stages take place over time periods T,- 
T7. Each time period is a clock cycle for the processor machine clock. A second instruction 304. can enter the pipeline 
in period Tg. since the previous instruction has now moved on to the next pipeline stage. For instruction 3, 306, the 
PRE-FETCH stage 202 occurs in time period T3, As can be seen from Figure 7 for a seven stage pipeline' a total of 
seven instructions may be processed simultaneously For all seven instructions 302-314, Figure 7 shows them all under 
process in time period T7. Such a structure adds a form of parallelism to the processing of instructions, 
[0042] As shown in Figure 8, the disclosed embodiment includes a memory management unit 104 which is coupled 
to external memory units (not shown) via a 24-bit address bus 1 1 4 and a bi-directional 1 6-bit data bus 1 1 6. Additionally, 
the memory management unit 104 is coupled to program storage memory (not shown) via a 24-bit address bus 118 
and a 32-bit bi-directional data bus 120. The memory management unit 104 is also coupled to the I Unit 105 of the 
machine processor core 102 via a 32-bit program read bus (PB) 122. The P Unit 108, A Unit 110 and D Unit 112 are 
coupled to the memory management unit 1 04 via data read and data write busses and corresponding address busses. 
20 The P Unit 108 is further coupled to a program address bus 128. 

[0043] More particularly, the P Unit 1 08 is coupled to the memory management unit 1 04 by a 24-bit program address 
bus 128, the two 16-bit data write busses (EB, FB) 130, 132. and the two 16-bit data read busses (CB, DB) 134, 136. 
The A Unit 110 is coupled to the memory management unit 104 via two 24-bit data write address busses {EAB, VaB) 
160, 162, the two 16-bit data write busses (EB, FB) 130, 132, the three data read address busses (BAB, CAB,' DAB) 
25 164, 166. 168 and the two 16-bit data read busses (CB. DB) 134, 136. The D Unit 112 is coupled to the memory 
management unit 104 via the two data write busses (EB, FB) 130. 132 and three data read busses (BB CB DB) 144 
134, 136. » > / , 

[0044] Figure 8 represents the passing of instmctions from the I Unit 106 to the P Unit 108 at 124, for forwarding 
branch instructions for example. Additionally, Figure 8 represents the passing of data from the I Unit 106 to the A Unit 
30 110 and the D Unit 112 at 126 and 128 respectively. 

[0045] Various aspects of the processor are summarized in Table 1 . 
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Processor Summary 




Very Low Power programmable processor 




Parallel execution of instructions, 8-bit to 48-bit instruction format 




Seven stage pipeline (including pro-fetch) 


40 


Instruction buffer unit highlight 


32x16 buffer size 

Parallel Instmction dispatching 

Local Loop 


45 


Data computation unit highlight 


Four 40-blt generic (accumulator) registers 

Single cycle 17x17 Multiplication-Accumulation (MAC) 40-bit ALU, "32 + 8" or " 
(2 x 16) + 8" 

Special processing hardware for Viterbi functions Ban-el shifter 


SO 


Program flow unit highlight 


32-bits/cycle program fetch bandwidth 
24-bit program address 

Hardware loop controllers (zero overhead loops) 

Interruptible repeat loop function 

Bit field teist for conditional jump 

Reduced overhead for program flow control 
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Tabic 1 : (continued) 





Processor Summary 




Data flow unit highlight 


Three address generators, with various addressing modes 


5 




Three 7-bit main data page registers 






Two Index registers 






Eight 16-bit pointers 






Dedicated 16-bit coefficients pointer 
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Four 1 fi-hit np»n(an^ r^nicf^rt 




Three independent circular buffers 






Pointers & registers swap 






16-bits ALU with shift 




Memory Interface highlight 


Three 16-bit operands per cycle 
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32 -bit program fetch per cycle 






Easy interface with cache memories 




C compiler 






Algebraic assembler 
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0046] The microprocessor is configured to respond to a local repeat instruction, which provides for an iterative 
looping through a block of Instructions. The local repeat instruction is a 16-bit instruction and comprises- an op-code- 
parallel enable bit; and an offset (6 bits). The op-code defines the instruction as a local instruction, and prompts the 
microprocessor to expect the offset and op-code extension. In the described embodiment the offset has a maximum 
va ue of 55. However this does not mean that the loop size is limited to 55 bytes. Indeed, this offset indicates the 
difference between the block repeat end address and the start address with the start address being the address of the 
first instruction or pair of instructions and the end address being the address of the last instruction or last instruction 
of a pair of instructions. Therefore, the maximum loop size can be (55 + "size of last instruction"), which is less than 
or equal to 61 bytes. In other embodiments, the offset and loop size may be either larger or smaller, in accordance 
with a different size instruction buffer queue, for example. 

[0047] Referring again to Figure 5, when the local loop instruction is decoded, the start address for the local loop i 
e. the address after the local loop instruction address, is stored in the Block Repeat Start Addresso (RSA^) register 
which IS located, for example, in the P unit 108. After the initial pass through the loop, the Read Prograr^ Counter 
(RFC) IS loaded with the contents of RSAq for reentering the loop. The location of the last instruction of the local loop 
IS computed using the offset, and the location is stored in the Block Repeat End Addresso (REAq) register which may 
also be located in the P unit 1 08, for example. Two repeat start address registers and two repeat end address registers 
(RSAq 550^ RSAi 551, REAq, REA^) are provided for nested loops. For nesting levels greater than two, preceding 
starl/end addresses are pushed lo a slack register. In addition to these four registers, the block repeat control circuitry 
also includes two Block Repeat Count (BRC0/BRC1) registers and associated control circuitry. 

[0048] Typically, DSP program code results in a significant amount of processor execution cycles resulting from 
intensive repetition of loops. In the present embodiment, most of these loops can be managed as a 'local repeat* where 
the code is directly executed from the instruction buffer and fetch from external memory is disabled. This will be de- 
scribed in more detail with reference to Figure 1 8. Since those local repeat loops involve a limited numberof instructions 
driven by the nature of the algorithm, there is an opportunity to selectively disable an entire functional unit or one or 
. more partition?, of a function unit or control circuitry in order to minimize power consumption. This can be done by 
profiling the block repeat body of instructions during the compile/assembly process or during the first iteration of the 
loop by monitoring circuitry within the microprocessor 

[0049] The microprocessor of the present embodiment of the invention has both a local repeat instruction and a 
general block repeat instruction for blocks which cannot fit entirely within the IBQ. Repeat loop profiling is associated 
with the local repeat instruction, since a large repeat block is less likely to use a limited set of hardware resources. 
However, aspects of the present invention are also useful in an embodiment which does not include a local repeat per 
se, but has just a general block repeat instruction, for example. In such an embodiment, a check can be done to 
detenmine block length and invoke repeat profiling only for short blocks, for example. 

[0050] When the assembler performs profiling, a repeat profile parameter is formed based on the analysis of the 
instructions within the block and is attached to the local block repeat instruction as an immediate operand. Typically 
one extra byte is enough to specify the selected partitions which can be disabled. 

[0051] When the monitoring hardware perfonns profiting, it is determined from the execution of the first iteration of 
a block of instructions the hardware resources required for executing that block of instructions. Then from the second 
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to last iteration useless hardware and associated control decode logic can be disabled, or inhibited. 
[0052] Figure 9 is an illustVation of grouping within an instruction set of the processor. An aspect of the present 
invention is that the instruction decoder can be partitioned into a number of partitions based on instruction groups For 
example, in Figure 9 there is illustrated an instruction set 900 with five instruction groups, 901-905. Depending on the 
instructions used within a repeal block, one or more of the instructions groups may not be represented. For example, 
during a first repeat loop, the block of instructions consist of instructions within only group 902 and 904 Instructions 
within groups 901, 903 and 905 are not used. Therefore, decode logic associated with these non-used instruction 
groups or addressing modes which don't need to be decoded can therefore be disabled during the iterative execution 
of this first block of instructions. A subsequent repeat loop may have a block of instruction in which different instruction 
groups are not represented. Different decode logic associated with these different non-used instruction groups can 
therefore be disabled during the iterative execution of the subsequent block of instructions. This scheme allows trade- 
off of a large DSP instruction set for encoding flexibility and code size optimization while keeping the dynamic instruction 
set seen by the decode hardware to a minimum. 

[0053] Within the processor of the present embodiment of the invention, a set of control flow instructions are defined 
which are not allowed to be used within a repeat loop, including: goto, call, return, switch, intr (software interrupt) trap 
reset, and idle. The control flow instructions that are inherently illegal in local repeat don't need to be decoded during 
execution of the loop. Therefore, by partitioning the instructions decode hardware to place this set of instruction in a 
separate partition, a significant amount of gates can be frozen during a local repeat loop execution regardless of the 
block repeat profile parameter. 

[0054] For example, an alternate embodiment of the present invention does not provide support for a repeat profile 
parameter; however, power consumption is reduced by inhibiting operation of a partition of the instruction decoder 
corresponding to the inherently forbidden group of instructions during the step of repetitively executing the block of 
instructions while a remainder of the instruction decoder decodes the block of instructions as they are executed in the 
pipeline. 

[0055] Table 2 illustrates instruction encoding for the repeat profile parameter that is appended to the local repeat 
instruction of the processor of the present embodiment. 
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Encoding for local repeat instruction with repeat profile parameter 


Without profiling 


Localrepeat(16) 
0000 OOOE 0011 1111 


1 6 block length in bytes 


With profiling 


Localrepeat(16) 

0000 OOOE 0011 1111 pppppppp 


Same algebraic syntax. The profile is not determined by the 
user but by the assembler 



As noted in Table 2, a person writing a program is not responsible for the repeat profile parameter. This parameter is 
detemnmed by the assembler in the present embodiment, or by monitoring hardware in an alternative embodiment 
without assistance or direction from the programmer. 

[0056] One skilled in the art will recognize that other encodings can be used for a repeat instruction In this embod- 
iment, the repeat profile parameter is appended to the repeat local instruction. However, one skilled in the art will 
recognize that a repeat profile parameter may be appended to or associated with any instruction that acts as a prologue 
instruction for a repeat loop. For example, in another embodiment the repeat profile parameter is passed by a load 
instruction which is inserted in the machine-readable instruction stream by the assembler for execution prior to exe- 
cution of the associated repeat loop. 

[0057] Figure 10 is a block diagram illustrating the instruction execution pipeline of processor 100 in more detail 
including partitions of the instruction decoder. The instruction decoder of the present embodiment is hierarchical A 
first level of instruction decoding is associated with the DECODE pipeline stage and is represenieu by partitions 802a- 
e, and 512. A second level of instruction decoding is associated with the ADDRESS pipeline stage and is represented 
by instruction decoding hardware 808 having partitions 81 Oa-c, 81 2a-c, 820a-c and 822a-c. Each instruction decoder 
partition is associated with an instruction group. The instruction groups illustrated in Figure 9 and the partitions illustrated 
in Figure 10 are simplified for illustrative purposes. Various embodiments of the invention may have more or fewer 
instruction groups and decoder partitions than herein illustrated. 

[0058] As discussed earlier, an instruction pair is received into instruction register 522 and then decoded The in- 
struction format extracted by decoder 512 in the DECODE pipeline stage defines an instruction #1 / instruction #2 
boundary and controls mux 521. Instruction #1 and instruction #2 are then isolated by being loaded into separate 
instruction registers 805, 806 according to respective format in the ADDRESS stage. In the DECODE stage control 
flow instruction are decoded by partition 802a, repeat instructions are decoded by partition 802b, soft dual instructions 
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are decoded by partition 802c, address modes are decoded by partition 802d and stack pointer control instructions 

are decoded by partition 802e. 

[0059] During the ADDRESS pipeline stage the second level instruction decoder 808 determines which Data Unit 
resources are required to process the instructions pair. Data units 38, 40 and 42 are presented for illustrative purposes 
Various embodiments may have additional of fewer Data Units. Data Units that are determined to be useless ?or the 
current instructions pair execution are kept frozen in order to reduce power consumption. This is done by maintaining 

^^^'r^'Z ""^f '^'''^ ^""^ P'^'''''^' 9^**^9 ^'°^'<^ ^^"t^ol hardware 831 , 832 anc^ 

or 833 so that signal transitions do not occur within the unneeded units. Local decode hardware associated with the 
unneeded unit in the READ pipeline stage is also kept frozen. ^"^irfiea wim ine 

[0060] Advantageously, the local repeat profiting scheme allows anticipation of the data resources that are unneeded 
for a given loop execution and avoids decoding for each step if the current instruction opcode is within the group of 
instruction involving such unit. A repeat profile parameter provided as an immediate operand of a repeat local instruction 
IS stored in a repeat profile register 800. Therefore, identified partitions within instruction decoder 808 are inhibited in 
response to the repeat profile parameter during repetitive execution of an associated loop 

'^P^^^ P'°^"'"9 scheme also allows freezing selected partitions of decode hardware 
in the DECODE pipeline stage. However, as discussed above, all the Control Flow instructions (goto, call ) that are 
Illegal within a local repeat body don't need to be decoded. Therefore, associated hardware in partition 802c can be 
frozen dunng the entire loop execution regardless of whether or not a repeat profile parameter is provided 
[0062] The profile can determine if the loop body includes nested local repeal or single repeal instructions When 
there is no nesting, hardware partition 802b associated with 'local repeat & repeat" decode can be frozen during the 
entire loop execution. ^ 

[0063] The profile can determine if the loop body includes stack pointer related instructions (push() pop() ) When 
there are no stack pointer related instructions, hardware partition 802e associated with "pushO / pop() family' decode 
can be frozen during the entire loop execution. r- \/ k kv/ y v, 

[0064] The profile can detemnine if the loop body includes soft dual or built in dual instructions. The instruction ex- 
traction hardware and the Address Generator control can take advantage of this static configuration to reduce gate 
activity hardware partition 802c. ^ 

[0065] Figure 11 is a block diagram illustrating block repeat control circuitry 1100 of the processor in more detail 
including repeat profile register 800 and a profile mask 1101. Repeat profile register 800 is loaded with a repeat profile 
parameter provided by a repeat local instruction. In case of loop nesting, two options are possible. In a first option the 
profile IS determined according to the resources needed by both the outer and the inner loops. In a second option' the 
outer loop and the inner loop have their own profiles and register 800 includes two registers that can be separately 
selected by a mux included in mask 1 1 01 . or by other means. The profiles are then managed as a stack by finite state 
machine (FSM) 1104. The profile is switched according to the active level of block repeat. This scheme provides a 
^5 better granularity but requires some extra hardware. 

[0066] The profile is masked by mask 1101 in response to FSM 1104 as soon as the local repeat of a block of 
instructions is completed, or in case the loop execution is interrupted. Upon return from interrupt service routine (ISR), 
the profile is unmasked and becomes active. This allows the full instruction set to be active during the ISR. 
[0067] Still referring to Figure 11 , profile signals IllOa-n from mask circuitry 1101 are provided to various hardware 
partitions of the microprocessor in order to inhibit operation of selected partitions. For example, profile signal 1110a is 
provided to instruction decoder partition 1102c. Likewise, other profile signals from mask 1101 are provided to other 
partitions in the DECODE stage. Certain partitions, such as 1102b, need to remain enabled at all times and do not 
respond to profile signals. As discussed above, certain partitions, such as control flow partition 802a are disabled 
regardless of the profile parameter whenever a local loop is executed, as indicated by decode signal 1111 from decode ' 
partition 802b. Inhibit signal 1112 from FSM 1104 is asserted in response to decode signal 1111. 
[0068] The decode hardware partitioning matches the granularity defined by the profile parameter bit. Inhibiting, or 
freezing, is handled by an extra signal input for the respective profile signal on a Isl stage of decode logic (extra gate 
or extra input). This freeze control can be seen as static signal for the duration of the loop execution. This avoids 
propagation through the logic of useless transitions or glitches. The profile infonnation may be used on other embod- 
50 iments to freeze D-flipflops (DFF's) or latches by clock control where in conventional design this may generate a speed 
path for gated clock enable signal. 

[0069] Still referring to Figure 1 1 , pipe delay register 1 1 20 maintains timing of the profile signals for ADDRESS stage 
partitions of decoder 808. A freeze performed in the ADDR stage will be propagated to read stage decoder partition 
1130 without extra control. Also, various profile signals may be combined by logic gates, such as gate 1140, to create 
55 combinations and permutations of the profile signals to inhibit various hardware partitions. 

[0070] Figure 12 is a block diagram of an another embodiment of the present invention illustrating the block repeat 
control circuitry of the processor in more detail, including instruction register 522 for variable size instruction words. 
Repeat profile register 1 200 is loaded with a repeat profile parameter provided by a repeat local instruction. The profile 
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is unmasked by mask 1 201 in response to FSM 1 204 as soon as execution of the local repeat of a block of instructions 
is started. The profile is masked by mask 1201 in response to FSM 1204 as soon as the local repeat of a block of 
instructions is completed, or in case the loop execu:. jn is interrupted. 

[0071] A profile parameter stored in profile register 1200 can identify the maximum length instruction format in the 
loop body. The instruction register is partitioned into several partitions 1230-1235. The hardware can then selectively 
inhibit unneeded instruction register partition to adjust instruction register size accordingly. A clock signal IRLOAD 
loads instruction register 522 with a new instruction selected from the instruction buffer (see Figure 5) by mux 520. 
Gates 1 220-1223 each receive a profile signal from parameter register 1200 via mask 1201 that is combined with clock 
signal IRLOAD to inhibit loading of selected partitions of instruction register 522 during repetitive execution of a block 
of instructions. For example, if the maximum length of all instructions executed in a given block repeat is determined 
to be five bytes, then partition 1 230 is inhibited by fomning a repeat profile parameter which causes profile signal 1 21 Oa 
to be asserted low during execution of the given block such the clock signal IRLOAD is inhibited from passing through 
AND gate 1220, thereby inhibiting clocking of instruction register partition 1230. Likewise, if the maximum instruction 
format is determined to be four bytes, then profile signal 1210a and profile signal 1210b are both asserted low during 
^5 repetitive execution of the associated block of instmctions to inhibit clocking of partitions 1 230 and 1 231 . Since a block 
wilt always have at least a two byte instruction, partitions 1234 and 1235 do not have inhibiting circuitry associated 
with them. One skilled in the art will realize that means otherthan an AND gate can be used to inhibit selected partitions 
in response to the repeat profile parameter. 

[0072] Still referring to Figure 12, in a similar manner, mux 520 can be partitioned and selected partilions inhibited 
in response to a maximum length instruction format indicated by the repeat profile parameter 

[0073] As noted in Table 2, an advantage of the present invention is that a person writing a program is not responsible 
for the repeat profile parameter. This parameter is detemnined by the assembler in the present embodiment, or by 
monitoring hardware in an alternative embodiment, without assistance or direction from the programmer. The embod- 
iments of Figure 1 1 and Figure 1 2 may be combined so that a single repeat profile parameter inhibits selected partitions 
of an instruction decoder and also selected partitions of an instruction register by appropriate selection and connection 
of profile signals from the profile parameter register, as indicated at 1250. 
[0074] Figure 13 is a timing diagram illustrating operation of processor 100 with two repeat loops in a nested loop. 
In case of nesting of loops, two options are possible: (1) a single composite profile is determined according to the 
resources needed by both the outer and inner loops, (2) the outer loop and inner loop each have their own profile. The 
30 profiles are then managed as stack. The profile is switched according to the active level of block repeat. The second 
option provides a better granularity but req uires some extra hardware. Referring again to Figure 1 1 , two profile registers. 
PROFILEO and PROFILE1 are included within 800 of the present embodiment. Mask 1101 includes MASKO and 
MASK1 . Muxing circuitry (not shown) within 1101 operates in response to FSM 1104 to provide the selected profile 
parameter on profile signals 1110a-n when one of them is unmasked. 
55 [0075] Referring now to Figure 13, timeline 1300 illustrates operation of profile signals IllOa-n (on Figure 11) or 
1210a-n (on Figure 12) during a nested loop; using the first option. A composite repeat profile representative of both 
an inner and an outer loop is detennined and stored in the profile register by a prologue instruction associated with 
the outer loop. As discussed earlier the prologue instruction may be the loop instruction which is decoded during time 
slot 1310, or it may be a store instruction, for example. If the repeat instruction for the inner loop provides a profile 
parameter it is ignored. The profile remains masked until time 1311 when the initial instruction of the block begins 
execution. The composite profile remains unmasked during the entire time 1304 of execution of the nested loops. 
During time 1312, the last iteration of the outer loop is perfomned. At time 1313, the final instruction of the last iteration 
is executed and the profile is again masked, as indicated at time slot 1306. 

[0076] Timeline 1300 is also representative of the operation of a single block repeat in which case time slot 1304 
represents iterative execution of the block of instructions and time slot 1312 represents the last iteration of the block 
of instructions. 

[0077] Still referring to Figure 13, timeline 1 340 illustrates operation of profile signals 1 1 1 0a-n (on Figui c 1 1 ) or 1 21 Oa- 
n (on Figure 12) during a nested loop, using the second option. A first repeat profile representative of an outer loop is 
detennined and stored in profile register PROFILEO by a prologue instruction associated with the outer loop. A second 
repeat profile representative of an inner loop is determined and stored in profile register PROFILE 1 by a prologue 
instruction associated with the inner loop. As discussed earlier, the prologue instmctions may be the inner and outer 
loop instructions which arc decoded during time slot 1350 and 1314, or they may be store instructions, for example. 
The profile remains masked until time 1311 when the initial instruction of the outer loop begins execution. The first 
profile remains unmasked during of execution of the outer loop, illustrated by time slots 1 343 and 1 345. At time 131 5, 
the initial instruction of the inner loop begins execution and the second profile is selected by FSM 1 1 04 during execution 
of the inner loop, illustrated by time slot 1344. As execution moves from inner loop to outer loop, and vice versa, the 
corresponding profile is selected by FSM 1104. Time line 1340 illustrates only a single iteration of the inner loop for 
clarity, but one skilled in the art realizes multiple iterations of the inner and outer loops typically occur During time 
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1?! ^' '^foTf TT" °' '"^ «''e="»ed During time 1352. the last iteration of the outer loop is performed 

^lot'me '"struction of the last iteration is executed and the profile is again masked, as indicated at time 

11!^!®' ^ ""^^"^ illustrating various steps involved in repetitively executing a block of instruction in 

n^rfn,t'°H .? ^ 'T^' P^^^""^'^^' "^"""^ Step 1400. Sequential execution of an instruction sequence is 

, ""derstood that the term "sequential" may include jumps, branches, calls, returns, etc. During 

step 1402 block repeat control circuitry is initialized by prologue instructions associated with a pending loop This 
mcludes for example, loading a block repeat count register This may also include loading a repeat profile register 
During steps 1204 and 1206, sequential execution is performed until a repeat instruction is decoded In a preferred 
ombodiment, the repeat instruction provides a repeat profile parameter that is determined for the associated block of 
.nsiructiorns thai are to be repetitively executed. At step 1408, a partition of the instruction decoder corresponding to 
a group of instructions that are inherently prohibited during repetitive block execution is inhibited 

^'f' " ^ received, then monitoring circuitry monitors execution of a first iteration 

of ihc block of instructions during step 1 412 and detemiines which partitions of the processor are not needed for the 
romain.ng rlerations. In either case, unneeded partitions of the instruction decoder are inhibited at step 1414 along 
wKh any other hardware partitions that have been detemiined to be unneeded for execution of the block of instructions 
The block of instructions is executed by repetitively looping through steps 1416, 1418, and 1422 If an inten-upt is 
detected m step 1418, then the profile is masked during execution of the ISR so that the ISR can be executed without 

rnnan^ ^'"^ k"^' "^"^ ''9^'" unmasked and is active in inhibiting unused partitions 

[0080] Each complete iteration of the block of instructions is checked at step 1 424. After the last iteration is completed 

nno^^T t '^ "'^^^'^'^ sequential execution is resumed at step 1426 without inhibited circuitry partitions 
[0081J Figure 15 is a block diagram illustrating monitoring circuitry for determining a profile during execution of a 
block of instructions by processor 1 00. Circuitry 1500 represents a partition of a portion of the hardware of processor 
1 00: which ,n the present embodiment is a partition of an instruction decoder, but in another embodiment may represent 
a partition of other portions of the processor, such as an instruction register, for example. Partition 1 500 receives signals 
'"struction register 1502 and provides one or more output signals 1510 representative of activity by partition 

500. Monitoring circuitry 1520 monitors signal(s) 1510 during a first iteration of the block of instructions If partition 
150P IS active during one or more of the instructions included within the block of instructions, then monitoring circuitry 
1520 IS set accordingly. At the end of the iteration, profile register 1530 is set according to monitoring circuitry 1520 
During remaining iterations of the block of instructions, AND gate 1 532 inhibits propagation of signals through partition 
1500. thereby reducing power consumption, in response to profile signal 1531 if partition 1500 was not used during 
the first Iteration of the block of instructions. One skilled in the art will recognize that AND gate 1532 is merely repre- 
sentative of circuitry for inhibiting partition 1500. Various embodiments of inhibiting circuitry are readily derived by one 
skilled in the art to embody aspects of the present invention. 

[0082] Figure 16 is a timing diagram illustrating operation of the monitoring circuitry of Figure 15 during execution of 
a block of instructions by the processor. During time slot 1 600. the first iteration of the block of instructions is performed. 
Figure 1 6 illustrates operation of three hardware partitions, for simplicity, unit_x at 1 620, unit_y at 1 621 . and unit_z at 
1622. Shaded areas of 1620 and 1621 indicate that unit_x.and unit_y are used by one or more of the instructions in 
the block of instructions during the first iteration. However, no shading in 1622 indicates that unit_z was not used during 
the first iteration. Therefore, setting of the monitoring circuitry at the end of the first iteration determines that unit_x and 
unit.y are needed, but unit_2 is not needed. At time 1 612. the repeat profile register Is set with a profile parameter in 
response to the monitoring circuitry. During the remaining iterations of the block of instrtjctions indicated by time slot 
1602, unit_2 is inhibited in response to the profile parameter to reduce power consumption. 

[0083] Referring again to Figures 4 and 1 0, there are several other portions of processor 1 00 that can be partitioned" 
and selectively inhibited during repetitive execution of a block of instructions in order to further reduce power consump- 
tion. For example, in one embodiment, the profile can indicate if the loop body includes instructions performing an 
initialization in the ADDRESS pipeline slot or a swap in the register file. The associaled hardware (not shown) can be 
frozen during the entire loop execution. 

[0084] In another embodiment, the profile can indicate if the loop body includes instructions involving the data coef- 
ficient pointer. The associated hardware (not shown) can be frozen during the entire loop execution. 
[0085] In another embodiment, if the loop requires only two address generators out of the three included in address 
unit 110, then the unneeded address generator can bo inhibited. 

[0086] In another embodiment, if the algorithm doesn't care about status update then the status update circuitry (not 
shown) can be inhibited during execution of the block repeat. 

[0087] In another embodiment, if it is determined that no instruction parallelism can be taken advantage of during 
execution of the block of instructions, then instruction register 806 and all associated control circuitry can be inhibited. 
Similariy, in a VLIW architecture where up to six to eight instructions can be dispatched per cycle, for example, it is not 
always possible to fully take advantage of such parallelism during repetitive execution of a block of instructions. A local 
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repeat profile can advantageously provide a mean to adjust the hardware according to the execution needs. For in- 
stance, rf within the loop the maxinnunn number of parallel instructions is four, then the profile can pass this information 
before loop execution in order to freeze useless hardware. 

[0088] The same approach can be applied for data fomriat. The processor support different data types: 8-bit, 1 6-bit, 
32-bit, and dual 16-bit. Other embodiments may support floating point, for example. The datapath is partitioned as 
slices and only the data path partitions required by the block of instruction are allowed to be active during repetitive 
execution of the block of instructions. 

[0089] Figure 17 is a flow chart illustrating various steps involved for fonming a repeat profile parameter by an as- 
sembler by determining what partitions will be needed during execution of a block of instructions. In step 1700. initial 
assembly tasks are performed. As used herein, the term "assembler" means any means for converting human readable 
programs into machine readable instruction sequences, including compiling and incremental compilation, for example. 
Assembler operation in general is known and will not be described further herein. In step 1 702, a table is created which 
has an entry for each machine readable instruction executable format. Each entry includes a pattern that indicates 
which selectable hardware partitions are required for execution of the associated instruction. For example, the pattern 
may indicate a particular instruction group that corresponds to a partition in the instruction decoder. The pattern may 
indicate instruction length, address mode, etc, depending on the selected processor and the hardware partitioning 
supported by that processor. 

[0090] In step 1704, the source code is transfomned into a sequence of machine readable instructions using known 
compilation/assembly techniques. In steps 1 706 and 1 708. each machine readable instruction is examined to determine 
if it is a repeat instruction. Once a repeat instruction is located, then in step 1 71 0 an initial instruction for the block of 
instruction associated with the repeat instruction is identified and a group pattern for the initial instruction is accessed 
and used as an initial profile parameter. In steps 1712 and 1714. each subsequent instruction of the block is examined 
and a group pattem associated with each is combined with the initial repeat profile parameter. Once the final instruction 
of the block of instructions is examined and its group pattern included in tho profile parameter, the profile parameter 
is associated with a prologue instruction associated with the block of instructions. In a preferred embodiment, the profile 
parameter is appended to the repeat instruction as shown in Table 2. 

[0091] In step 1718, the process continues and additional blocks of instructions and associated profile parameters 
are formed until the sequence of machine readable instructions is completely processed. In step 1720 the assembly 
process is completed, using known assembly techniques. The completed assembly process provides a sequence of 
30 machine readable instructions in which each repeatable block of instructions has a prologue instruction, such as a 
repeat instmction, with an appended repeat profile parameter. 

[0092] Referring now to Figure 1 8 and with reference to Figure 5, the local loop instruction flow for the preferred 
embodiment will be described in more detail. The local loop repeat is set up by initializing a Block Repeat Count 
(BRC0/BRC1), shown in the DECODE stage in a first pipeline slot 602, with the number of iterations of the local loop, 
35 and then in the next slot 604 the local loop instruction (RPTL) itself is decoded. The BRC0/BRC1 is decremented for 
each repeat of the last instruction of the loop if BRCO (or respectively BRC1) is not zero. It will be evident to a skilled 
person that optionally the local loop repeat may be set up by defining a maximum iteration value, and initializing a 
counter to zero. The counter can then be incremented for each repeat of the last instruction of the loop. The decrement 
or increment may be in steps other than one. During slots 602 and 604, the Program Counter increases by four bytes 
to a value "PC", and two further instruction words are fetched into the IBQ 502, thus two instruction words per slot 602, 
604 are fetched into IBQ 502. In slot 602 the number of words 504 available in the IBQ 502 is 2, and is shown labeled 
Count in Figure 18. The number of words available in the IBQ 502 is given by the difference between the LRPC 536 
and the LWPC 532, since they respectively point to the currently dispatched instruction and the location for writing the 
next instruction into the IBQ 502. Since, for the purposes of this embodiment, the instruction which initializes the 
BRC0/BRC1 is a one word 16-bit instruction, for example and BRC0/BRC1=DAx comprises no parallelism, only the 
16-bit initialization instruction is dispatched to the first or second instruction decoder 512, 514 in slot 602. 
[0093] For the next slot 604, the WPC increases by four to a value "PC" and a further 2 x 16-bil inslruclic n words 
504 are fetched in the IBQ 502. The number of instruction words 504 available in the IBQ 502 is now 3, since only the 
1 word instruction initializing BRC0/BRC1 was dispatched during the previous slot 602. 
50 [0094] The first iteration of the local loop begins at slot 606, where a first parallel pair of instructions L^, are 
dispatched to the decoders 512, 514. The number of instruction words 504 which are available in the IBQ 502 is now 
4. This is because in the present embodiment the local loop instruction is only a 16-bit instruction and therefore only 
one word 504 was dispatched to the decoder 512 during the previous slot 604. 

[0095] In order to optimize the execution of the local loop, the instructions are executed in parallel so far as is possible. 
In the present example, it is assumed that all instructions comprising the body of the loop are executable in parallel. 
This results in two unused slots, 610, 612 during the first pass of the body of the loop, but leads to greater speed for 
the rest of the iterations. 

[0096] Additionally, for the present example instructions Lq. are executable in parallel and comprise a total of 48 
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bits, thus 3 instruction words 504 are dispatched to the decoders 512. 514 for each decode stage. For the start of the 
S^and wSr^ '"I T^'TZo' ^-^-P^tcUe^ to the decoders and the difference between the LRPC 

dfspatchld ^ ''''' instruction words are fetched into the IBQ, but three words are 

50?tn th J n! \T?^T T ^ ^'^"^ '^^ ^^^^ moves two words along the IBQ 

502 to the next fetch location. Thus, the difference between LWPC 532 and LRPC 536 is decreased by one to three 

otTt.!!^^^^^^^^ ''T*"^ instructions L,. L3 are executable in parallel and comprise a total 

of 48 b.ts the LRPC 532 moves 3 words along the IBQ 502 ready for the next slot 61 0. The program pre-fetch is halted 
T^'^fnT' ""^^^ ^""^ '^^'^^""'^ "° instruction words are loaded into the IBQ 502 for this slot Thus for 

Slot 61 0 the LRPC 536 and LWPC 532 point to the same IBQ 502 address, and Count ^ 0. Since there are no avaifable 

into iRotnP H '«'fn'''^ " ^'^^ ^^^^ are fetched 

mto the IBQ 502 dunng slot 610 moving LWPC 532 along IBQ by two words, and therefore there are two instruction 
words available for slot 612. However, if the next two instructions, L,. L,. are parallel instructions comprising 48 bits 
then there is no dispatch m slot 612, and there is a further unused slot. 
'5 [0098] For slot 614 there are a total of four instruction words 504 available in the IBQ 502 and instructions L L 

llTJu^^^^^^^ ^ instruction words 504 are fetched into the IBQ 

502 during slot 61 4. The WPC has now increased by 1 6 packets of 2 x instruction words 504, and thus the IBQ 502 
IS fun and all the loop body has been fetched. Thus, as can be seen, the WPC count for slot 616 remains at PG^16 for 
nnQoT I i"" ""H"''^^' '""^ ""'^ ^^^"^'^^ '^^ originating from the pre-fetch of slot 61 4. 

En TK- \T ^ ^ ^^'^^^^ '"^^ 5^2' '^^'^ 32 words available in the 

BQ. This IS the maximum size of the IBQ 502. and hence the fetch is switched off for further slots 61 8. 620 onwards 
forming further iterations of the loop. upiwdrut* 

[0100] For the last iteration of the loop, the fetch is switched back on in slot 626 In order to top up thelBQ 502 to 
avoid any gaps in the queue. 

^5 [0101] Thus, forthe body of the loop, excluding the first and last iteration there is no pipeline fetch stage Thus there 
IS no program memory access. This reduces power consumption during the loop compared to conventional loops 
since fewer program memory accesses are performed. ' 
[0102] Thus, in accordance with an embodiment of the invention, the microprocessor is configured to respond to a 
local repeat instruction which provides for an iterative looping through a set of instructions all of which are contained 
in the Instruction Buffer Queue 502. Referring again to Figure 5. the IBQ 502 is 64 bytes long and is organised into 
32X 16 bit words. Instructions are fetched into IBQ 502 two words at a time. Additionally, the Instruction Decoder 
Control er reads a packet of up to six program code bytes into the instruction decoders 512 and 514 for each Decode 
Stage of the pipeline. The start and end of the loop. i.e. first and last instructions, may fall at any of the byte boundaries 
within the four byte packet of program code fetched to the IBQ 502. Thus, the start(first) and end(last) instructions are 
not necessarily co-terminous with the top and bottom of IBQ 502. For example, in a case where the local loop instruction 
spans two bytes across the boundary of a packet of four program codes, both the packet of four program codes must 
be retained in the IBQ 502 for execution of the local loop repeat. In order to take this into account the local loop 
instruction offset is a maximum of 55 bytes. 

[0103] During the first iteration of a local loop, the program code for the body of the loop is loaded into the IBQ 502 
and executed as usual. However, for the following iterations no fetch will, occur until the last iteration during which the 
fetch will restart. 

[0104] Another embodiment the microprocessor is configured to align instruction words in the IBQ 502 in order to 
maximize the block size for a local loop. The alignment of the instruction words may operate to place start and end 
instructions for a local loop as close to respective boundaries of the IBQ 502 as possible. An embodiment of the 
assembler configures the alignment of instructions in the IBQ 502 to maximize the block size for a local loop. 
[0105] Referring again to Figure 1, fabrication of data processing device 10 Involves multiple steps of Implanting 
various amounts of impurities into a semiconductor substrate and diffusing the impurities to selected depths within the 
substrate to form transistor devices. Masks are formed to control the placement of the impurities. Multiple layers of 
conductive material and insulative material are deposited and etched to interconnect the various devices. These steps 
are performed in a clean room environment. 

[0106] A significant portion of the cost of producing the data processing device involves testing. While in wafer form, 
individual devices arc biased to an operational state and probe tested for basic operational functionality. The wafer is 
then separated into individual dice which may be sold as bare die or packaged. After packaging, finished parts are 
biased into an operational state and tested for operational functionality. 

[0107] An alternative embodiment of the novel aspects of the present invention may include other circuitries that are 
combined with the circuitries disclosed herein in order to reduce the total gate count of the combined functions. Since 
those skilled in the art are aware of techniques for gate minimization, the details of such an embodiment will not be 
described herein. 
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[0108] Thus, there has been described a processor that is a programmable digital signal processor (DSP), offering 
both high code density and easy programming. Architecture and instruction set are optimized for low power consump- 
tion and high efficiency execution of DSP algorithms, such as for wireless telephones, as well as pure control tasks. 
The processor includes an instruction buffer unit, and a data computation unit for executing the instructions decoded 
5 by the instruction buffer unit. Instructions can be executed in a parallel manner, either in response to implicit parallelism 
or in response to user defined parallelism. 

[0109] Partitioning of the instruction decoder for several instruction groups allows one or more of the decoder parti- 
lions to remain idle during execution of an instruction loop. Consequently there is a corresponding reduction in power 
consumption by the microprocessor. Advantageously partitioning of other portions of the processor and inhibiting op- 
10 cration of selected partitions further reduces power consumption of the processor 

[0110] In view of the foregoing description it will be evident to a person skilled in the art that various modifications 
may be made within the scope of the invention. For example, the instructions comprising the body of the loop need 
not bo full 48-bit parallel instructions, or even parallel instmctions at all. Additionally, the loop need not take up all of 
the IBQ. but may be smaller than that described above. In another embodiment, an IBQ is not provided. In another 
^5 embodiment, the program memory comprises a memory cache. In alternative embodiments, the instruction decoder 
may be partitioned across a number of pipeline stages, or be included completely within one pipeline stage. 
[0111] A method for operating a digital system comprising a microprocessor is provided, wherein the method com- 
prises the steps of: partitioning a portion of the microprocessor into a plurality of partitions; executing a sequence of 
instructions within an instruction pipeline of the microprocessor; repetitively executing a block of instructions within the 
sequence of instructions, wherein the block has an initial instruction and a final instruction; determining that at least 
one of the plurality of partitions is not needed to execute the block of instructions; and inhibiting operation of the at 
least one partition during the step of repetitively executing the block of instructions, whereby power dissipation is 
reduced. 

[0112] In a further embodiment of the method, the stop of partitioning comprises partitioning an instruction register 
of the microprocessor in accordance with different instruction lengths: wherein the step of determining comprises de- 
termining a maximum instruction length of instructions within the block of instructions; and the step of inhibiting com- 
prises inhibiting loading of one or more of the instruction register partitions in accordance with the determined maximum 
instruction length. 

[01 13] In a further embodiment of the method, the step of partitioning comprises partitioning the instruction pipeline 
in accordance to parallel instruction execution; wherein the step of detemnining comprises determining a maximum 
instruction parallelism of instructions within the block of instructions; and the step of inhibiting comprises inhibiting one 
or more parallel instruction execution partitions. 

[01 1 4] In a further embodiment of the method, the step of partitioning comprises partitioning a portion of the micro- 
processor in accordance to data types; the step of determining comprises detemnining one or more data types not 
used within the block of instructions; and the step of inhibiting comprises inhibiting one or more data type partitions. 
[01 1 5] In a further embodiment of the method, the step of detemriining comprises detemnining that updating of status 
circuitry is not required within the block of instructions; and the step of inhibiting comprises inhibiting updating of the 
status circuitry. 

[0116] In a further embodiment of the method, the step of partitioning comprises partitioning address generation 
circuitry of the microprocessor into a plurality of partitions accordance to address modes; the step of determining 
comprises determining one or more address modes not used within the block of instructions; and the step of inhibiting 
comprises inhibiting one or more address generation partitions. 

[0117] In a further embodiment of the method, the step of determining further comprises first monitoring execution 
of a first iteration of the block of instructions and thereby deriving the repeat profile parameter 

[01 1 8] In a further embodiment of the method, the step of storing a repeat profile parameter comprises storing a first 
repeat profile parameter representative of an inner loop and storing a second repeat profile parameter representative 
of an outer loop; and the step of inhibiting comprises inhibiting operation of a first partition of the microprocessor durir g 
execution of the inner loop, and inhibiting operation of a second partition of the microprocessor during execution of the 
outer loop. 

[01 19] In a further embodiment of the method, comprises the steps of; interrupting the step of repetitively executing 
a block of instructions to execute an internjpt service routine (tSR); masking partition inhibition so that all partitions of 
the microprocessor arc enabled during execution of the ISR; and unmasking partition inhibition when returning to 
repetitive execution of the block of instructions after execution of the ISR is completed. 

[0120] In a further embodiment of this method, the step of masking partition inhibition comprises masking a repeat 
55 profile parameter 

[0121] Advantageously, aspects of the present invention may be combined with other techniques for power man- 
agement within a processor to further reduce power consumption of a processor For example, various functional units 
may be placed in a standby mode during loop execution if a functional unit is not used by any of the instructions in the 

( 
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loop. 

[0122] The scope of the present disclosure includes any novel feature or combination of features disclosed therein 
TZZT ' '""^ ""f ^"V generalization thereof irrespective of whether or not it relates to the claimed invention 
or mitigates any or all of the problems addressed by the present invention 

[0123] As used herein, the terms "applied/' "connected/' and "connection" mean electrically connected includinq 
where additional elements may be in the electrical connection path. "Associated" means a controlling relationship 
such as a memory resource that is controlled by an associated port. The temis assert, assertion, de-assert de-asser^ 
tion, negate and negation are used to avoid confusion when dealing with a mixture of active high and active low signals 
Assert and assertion are used to indicate that a signal is rendered active, or logically true. De-assert de-assertion* 
negate, and negation are used to indicate that a signal is rendered inactive, or logically false 

[0124] While the Invention has been described with reference to illustrative embodiments, this description is not 
mtended to be construed in a limiting sense. Various other embodiments of the invention will be apparent to persons 
skilled in the art upon reference to this description. For example, various portions of the processor can be partitioned 
into a set o partitions, as described herein. In a given embodiment, any one or more sets of partitions can be provided 
and controlled by a single or by multiple repeat profile parameters. 
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Claims 

1 . A method of operating a digital system comprising: 



partitioning a portion of a microprocessor into a plurality of partitions; 

executing a sequence of instructions within an instruction pipeline of'the microprocessor 

repetitively executing a block of instructions within the sequence of instructions, wherein the block has an 

initial instruction and a final instruction; 

determining that at least one of the plurality of partitions is not needed to execute the block of instructions and 
inhibiting operation of the at least one partition during the step of repetitively executing the block of instructions 
whereby power dissipation is reduced. 
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The method according to Claim 1. wherein the step of determining comprises the step of storing a repeat profile 
parameter indicative of the at least one partition that is not needed to execute of the block of instructions. 

3. The method according to any preceding claim, wherein: 

(a) the step of partitioning comprises partitioning an instruction decoder into a plurality of partitions such that 
each partition of the instruction decoder is associated with a group of instructions; 

(b) the step of determining comprises storing a repeat profile parameter indicative of at least a first group of 
instructions not contained within the block of instructions; and 

(c) the step of inhibiting operation of a portion of an instruction decoder further comprises inhibiting operation 
of a partition of the instruction decoder corresponding to the first group of tnstnjctions. 

4. The method according to any preceding claim, wherein the step of inhibiting comprises inhibiting a first partition 
of the instruction decoder associated with a first stage of the pipeline and inhibiting a second partition of the in- 
struction decoder associated with a second stage of the Instruction pipeline. 

5. The method according to any preceding claim, wherein: 

(a) the step of partitioning comprises partitioning an instruction decoder into a plurality of partitions such that 
each partition of the instruction decoder is associated with a group of instructions; 

(b) the step of determining comprises identifying a group of instructions that are inherently forbidden from 
being executed during repetitive execution of a block of instructions; and 

(c) the stop of inhibiting comprises inhibiting operation of a partition of the instruction decoder corresponding 
to the forbidden group of instructions during the step of repetitively executing the block of instructions while a 
remainder of the instruction decoder decodes the block of instructions as they are executed in the pipeline. 

6. The method according to any preceding claim, wherein the step of determining further comprises first receiving 
the repeat profile parameter as a parameter associated with a prologue instruction prior to the step of repetitively 
executing the block of instructions. 
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7. The method of Claim 6 further comprising the step of: 

assembling a source code program to create the sequence of instructions comprising the block of instructions, the 
prologue instruction and the associated repeat profile parameter the step of assembling comprising: 

(a) creating an instruction table with an entry for each instruction executable by a selected microprocessor, 
such that the entry for each instruction includes a group pattern defining a group of instructions that includes 
that instruction; 

(b) transforming the source code into a sequence of instructions; 

(c) determining the initial instruction and the final instruction for the repeatable block of instructions associated 
with the prologue instruction; 

(d) combining a plurality of group patterns selected from the instruction table representative of each instruction 
in the block of instructions to form a repeat profile parameter; and 

(e) associating the repeat profile parameter with the prologue instruction. 

8. A method of assembling a source code program for creating a sequence of instructions, the sequence of instruc- 
tions having a repeatable block of instructions including an initial instruction and a final instruction, which method 
comprising: 



creating an inslruclion table with an entry for each instruction executable by a selected microprocessor, such 
that the entry for each instruction includes a group pattern defining a group of instructions that includes that 
instruction; 

transforming the source code into a sequence of instructions; 

determining the initial instruction and the final instruction for a first repeatable block of instructions associated 
with a first prologue instruction; 

combining a plurality of group patterns selected from the instruction table representative of each instruction 
in the first block of instructions to form a first repeat profile parameter; and 

associating the first repeat profile parameter with the first prologue instruction in the sequence of instructions. 
9. The method of Claim 8 further comprising the steps of: 

30 

detennining the Initial instruction and the final instruction for a second repeatable block of instructions asso- 
ciated with a second prologue instruction: 

combining a plurality of group patterns selected from the instruction table representative of each instruction 
in the second block of instructions to fomn a second repeat profile parameter; and 

associating the second repeat profile parameter with the second prologue instruction in the sequence of in- 
structions. 



10. A digital system including a microprocessor which microprocessor comprising: 

a pipeline having a plurality of stages for executing Instructions dispatched thereto; 

an instruction decoder for decoding the instructions, the instruction decoder being controllably connected to 
the pipeline, wherein the Instruction decoder is partitioned into a plurality of partitions according to a respective 
plurality of instruction groups, at least one of the partitions having an inhibit input; 

block repeat control circuitry responsive to a prologue instruction to initiate iterative execution of a block of 
instructions, said block including an initial instruction and a final instruction; and 

wherein the instruction decoder Is operable to decode certain instruction groups and further operable to inhibit 
decoding of at least one instruction group in response to the block repeat circuitry, whereby power consumed 
by the instruction decoder is reduced during repeated execution of the block of instructions. 

11. The digital system according to Claim 10, wherein the block repeat control circuitry further comprises: 

repeat profile circuitry connected to receive a repeat profile parameter, an output of the repeat profile circuitry 

being connected to the enable input of the at least one instruction decoder partition; and 

wherein the instruction decoder is operable to decode certain instruction groups and further operable to inhibit 

decoding of at least one instruction group in response to a first repeat profile parameter in the repeat profile 

circuitry. 

12. The digital system according to Claim 1 0 or Claim 11 , wherein the instruction decoder is hierarchical, such that a 
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first portion of the instruction decoder is associated with a fiml stage of the pipeline and a second portion of the 
instruction decoder is associated with a second stage of the pipeline- and 

wherein at least a first instruction decoder partition in the first portion of the instruction decoder has a first 
- inhibit inpu connected to a first output of the repeat profile circuitry and at least a second instruction decoder 
paninon in the second portion of the instruction decoder has a second inhibit input connected to a second output 
Of the repeat profile circuitry. uipui 

13. The digital system according to any of Claims 1 0 to 1 2. wherein the repeat profile circuitiy is connected to receive 
a repeat profile parameter provided by the prologue instruction of the block of instructions. 

14. The digital system according to any of Claims 1 0 to 1 3, wherein the repeat profile circuitr/ is connected to receive 
a repeat profile parameter provided by monitoring circuitry coupled to the instruction decoder, wherein the moni- 
toring circuitry ,s operable to monitor the instruction decoder during a first iteration of a first block of instructions 

1 ^'"'^ ^ '^P^^' parameter indicative of a least a first group of instructions not included 

'5 Within the first block of instructions. una noi mciuaea 

15. The digital system according to any of Claims 1 0 to 14, wherein the repeat profile circuitry is operable to receive 
two repeat profile parameters representative of an inner loop and an outer loop, such that the instruction decoder 
IS operable to inhibit decoding of a first instruction group during execution of the inner loop and to inhibit decoding 
of a second instruction group during execution of the outer loop. 

16. The digital system according to any of Claims 1 0 to 15, further comprising: 

an instruction buffer for temporarily storing the block of instructions transferred thereto prior to dispatch to the 

execution unit; 

wherein the pipeline comprises an instruction fetch stage for fetching instructions from a program memory for 
transfer into the instruction buffer; and k y y i^r 

wherein the block repeat control circuitry is operable to inhibit the instruction fetch stage subsequentto fetchinq 
the final instruction of the block of instructions from the program memory into the instruction buffer. 

17. The digital system according to any of Claims 10 to 16. further comprising: 

an instruction register for holding each instruction prior to being decoded by the instruction decoder- and 
wherein the instruction register is operable to be partially inhibited in response to the repeat profile circuitry 
during execution of the block of instructions. 

18. The digital system according to Claim 17, wherein said instructions are of a variable length. 

19. The digital system according to any of Claims 10 to 18, said system comprising a cellular telephone comprising: 

an integrated keyboard connected to the processor via a keyboard adapter; 
a display, connected to the processor via a display adapter; 
radio frequency (RF) circuitry connected to the processor; and 
an aerial connected to the RF circuitry. 
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